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1. Introduction 


One of the worst symptoms of aging are brain related disorders. Although | have never 
met my great-grandmother, I’ve heard stories about her dementia, primary memory loss 
in her final years. It transformed her into a completely different person, someone who 
wouldn't even recognise her relatives. This showed me how important is our brain health 


and how any problems with it could be life changing. 


In search for more answers about the brain, | visited the Champalimaud Foundation in 
Lisbon, more specifically the Nuclear Medicine department. They were working on 
detecting brain related disorders such as Parkinson's disease, Alzheimer's disease, etc. 
Most interestingly, they showed the possibility of implementing machine learning for 
detecting those specific diseases. Through my great-grandmother's story and my visit to 
Champalimaud, | decided to write an investigation into detection of brain related 


disorders with the help of machine learning. 


In the department | was lucky to be shared with a refined dataset which contained 
dimensional features of striatum for patients with and without Parkinson's Disease (PD). 
As they explained, there is strong correlation between these features and whether a 
patient has PD, therefore making it suitable for a machine learning model. This dataset 


was used for conducting the experiment in this essay. 


According to National Institute on Aging [1], PD is primarily developed in people older 
than 60 years old. PD main symptoms include unintended and uncontrolled movements 
like shaking. Dopamine transporter (DaT) loss in the brain is a key feature of PD which 
results in the symptoms [2]. A scan completed with a combination of SPECT and 


DaTSCAN scanners is a common way to evaluate DaT levels in the brain [3]. 


Figure 1 shows an example of such scan: the left scan shows a healthy subject, the 


right scan shows a PD patient. The bright yellow-red-blue regions represent the healthy 


cells containing DaT. As one can see the PD patient has a clear decrease in healthy 
cells with DaT. Those regions also represent the size of striatum - region of the brain 
which controls the movement, as such in PD patients the striatum dimensions become 
smaller. This explains the correlation of striatum dimensional features to PD diagnosis 


(the dataset). 


Figure 1 - DatSCAN for normal vs PD patients [3] 


Visual examination of the dimensions of the striatum is not new, it is frequently used for 
the final diagnosis of possible PD patients. As seen in Figure 2 - width, length and 
thickness of the striatum can be extracted from a 3D scan. However, for medical staff, it 
can be time consuming and in some certain cases be hard to give an objective decision 
on whether the striatum dimensions are abnormal. Different quantification methods to 
help medical staff have been developed for more objective assessments, including 


machine learning. 
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Figure 2 - Width, Length and Thickness of segmented Striatum 


Use of machine learning increases the accuracy of automated diagnosis. Machine 
learning algorithms can consider many features at the same time making them 
multidimensional, which helps achieve high accuracy. An accurate machine learning 
model helps detect dopamine transporter loss early on and, therefore, assist a clinical 
decision for the diagnosis of PD. Spotting the disease early is important, because 


treatments such as levodopa/carbidopa will be more effective [4]. 


This work aims to compare two machine learning-based algorithms: k-nearest neighbour 
(k-NN) and Naive Bayes (NB). More specifically, “How does “k-nearest neighbour 
algorithm" compare to “Naive Bayes” algorithm in diagnosing Parkinson's Disease, 
when using striatum dimensional features as input data?". These algorithms were 
chosen due to their simplicity and quick implementability, as such they require little 
computational power allowing me to use my personal computer for the experiment. The 
algorithms are relatively basic, the experiment would demonstrate whether there is 
potential using these specific algorithms for PD diagnosis, and if so, which algorithm out 
of the two is the better one. Three features related to the dimensions of the striatum 
were considered: length, width, and thickness. The algorithms were trained and tested 
using 10-fold cross validation, the results were stored in a confusion matrix, and then 
were used to calculate various metrics to evaluate and compare the models in more 
detail. All human data studies in this work have been performed in accordance with the 


ethical standards laid out by IB. 


2. Theoretical Background 


A. Machine Learning 


Machine learning is a branch of computer science and artificial intelligence (Al) which 
focuses on imitating the way humans learn, that way gradually improving accuracy over 
time. This process of learning is also referred to as training the algorithm. To create a 
machine learning model, a combination of data and algorithms is used [5]. By “data” | refer 
to inputs that the algorithms process to achieve "output". Different "algorithms" differ in the 
way they process the “data”, both in training and testing. The terms of “algorithm” and 
"model" will be used interchangeably in this work. The final “output” depends on whether 


the algorithm used is a supervised or unsupervised learner. 


Figure 3 - Machine Learning 


B. Training and testing 


Training is an important procedure in machine learning, the algorithm in the model adapts in 
such a way that it can perform some certain tasks as successfully as possible. Usually, a 


model performs one kind of task, for example in this work: diagnosing a subject. 


After the model is "trained", it is "tested" to see how well it performs. That is done by giving 


the trained model data it has not previously seen, for example if the models in this work are 
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trained on subjects 1-400, we could test them on subject 401 and see whether they 


correctly classify the subject. 
C. Machine learning categories 


Machine learning algorithms are split into two categories based on their training method: 
supervised and unsupervised learning. The output of a supervised learning model is a 
prediction based on the input data, for example if an email is a spam or not a spam. The 
"input data" in such model could be the features of the email: number of words, types of 
words, etc. However, to train a supervised algorithm it requires experts that can "label" the 
data properly during the training stage [6]. A "label" is a correct tag to the data, using the 


email example, the “label” is a tag that classifies the email as spam or not spam. 


On the other hand, unsupervised algorithms are used to get some new insights from large 
amounts of input data. As such often, there is no specified output for unsupervised learning 


algorithms. 


The data | am using is already properly labelled by experts, which means every subject 
already has a label stating whether he/she has PD. The model needs to predict whether a 
subject has PD. As such for the purposes of this essay, supervised machine learning will be 


used, which is explained in more detail below. 


D. Supervised machine learning 


What defines supervised learning is its use of labelled datasets to train the model. The 
model is trained to do a certain task such as identify a disease. According to javatpoint.com 
"The aim of a supervised learning algorithm is to find a mapping function to map the input 


variable (x) with the output variable (y)." [7] 


Labeled Data 


(Y C] Prediction 


[] Square 


m A^ pi > A Triangle 


Model Training 


Lables 


Test Data 


Figure 4 - Working of a Supervised Learning Algorithm [7] 


Looking at figure 4, this supervised learning algorithm is trained to identify the 3 types of 
shapes: Hexagon, Triangle and Square. When training, the model receives the data x (the 
shape image) and the label y (*square" / *triangle" / "hexagon"). The model will look for 
patterns to be able to classify each shape. For instance, the “square” has 4 equal sides, the 
"triangle" has 3 sides, and so on. After the training process is complete, we can test the 


model with test data (similar but previously unseen) and find how well it performs. 


Supervised learning can be further split into two subcategories: Regression and 


Classification. 


Regression algorithms are used to find relationship between dependent and independent 
values. For example, it could be used to make projections such as sales revenue for a 
given business. As such, it is the task of producing a continuous quantity [8]. Common 


examples of regression algorithms are linear regression and polynomial regression [9]. 


Classification algorithms accurately assign data to specific categories. The previously 
mentioned ‘shape identifier’ model would be a good example; it puts each shape into a 
specific category. Classification is the task of predicting a discrete class label [8]. Support 
Vector Machine (SVM), k-nearest neighbour (K-NN), random forest are popular 


classification algorithms. 


For the purposes of this work, supervised classification algorithms are the best choice 
since we want to classify the subjects into two categories: positive for PD or negative for 


PD. | will refer to the classification algorithms as: classifiers and algorithm interchangeably. 


E. K-Nearest Neighbour Algorithm 


The k-NN algorithm is a non-parametric (doesn't make assumptions about underlying data), 
supervised learning classifier that uses proximity to make classifications about the grouping 
of data points. It is also a lazy learner algorithm, which means it doesn't directly learn from 
the training data, instead it stores it, and at the time of classification it uses it to compare it 


to new data. 


Imagine a model is built to identify dogs and cats, and the only two variables we have are: 
length of ears (X) and sharpness of claws (Y). Figure 5 below shows what this would look 


like. 


Sharpness of claws > 


Length of ears > 


Figure 5 - Dog and Cat classifier algorithm [10] 


As you can see the “Cat” class has sharper claws and shorter ears. Whereas the "Dog" 
class is longer eared, but the claws are less sharp. This is essentially a k-NN model after 
the “training” is complete. The input data is plotted, and the model also labels each data 
point as a dog or a cat. Next, imagine we have a query point (red dot) which we want to 
classify as a dog or a cat, based on these two features. Because the data point has more 
dog neighbours, it will be classified as a dog. This concept is also visualised in Figure 6 


below. 
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Figure 6 — k-NN clustering [11] 


To recap, the goal of a k-NN algorithm is to identify the nearest neighbours of a query point, 
so that to assign it to the nearest class. To do that the algorithm has two requirements: 
choosing the k-value and choosing a distance metric. The k-value specifies the number of 
neighbours that will be checked to give a classification to the query point [12]. Figure 7 
demonstrates the importance of the k-value, when the k-value is set. An imaginary circle 
can be visualised that captures k nearest neighbours. When k=3, there is two Class B 
neighbours and one Class A, hence the query point will be labelled as Class B as there is a 
Class B majority. But if k=7, the majority is Class A, hence the query point will be labelled 


as Class A. 
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Figure 7 - Example of k-NN classification [10] 


This demonstrates how choosing the value of k can be an act of balancing, as different 
values may lead to different classification. The choice of the best k-value can largely 
depend upon the size of the inputs. The value of k is recommended to be a whole odd 


number, so that to avoid ties. 


To classify the query points to a certain class, the distance between the query point and 
other data points needs to be calculated. The distance measured helps to identify the 


neighbours which in turn help classify the query points. 


There are many ways of measuring the distance between points, for the purposes of this 
work, Euclidian distance will be used since it is the most used distance metric. Using the 


formula below a straight line between the query point and the other point is measured. 


d(x,y) = 
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F. Naive Bayes 


Naive Bayes algorithms are a set of supervised learning algorithms based on applying 
Bayes' theorem with the "naive" assumption of conditional independence between every 


pair of features given the value of the class variable. [13] 


Consider a dataset that describes the conditions to play golf (Figure 8). Where the output is 
"Yes" or “No” for playing golf. The deciding features or X variables for playing golf are 


» o 


“Outlook”, "Temperature", "Humidity" and “Windy”. 


Outlook Temperature Humidity Windy Play Golf 


0 Rainy Hot High False No 
1 Rainy Hot High True No 
2 Overcast Hot High False Yes 
3 Sunny Mild High False Yes 
4 Sunny Cool Normal False Yes 
5 Sunny Cool Normal True No 
6 Overcast Cool Normal True Yes 


Figure 8 - Fictional golf dataset 


The fundamental Naive Bayes assumption is that each X variable makes an independent 
and equal contribution to the output. In relation to our dataset this can be understood as no 
X variable is dependent on the other. For example, “Hot” temperature has nothing to do 
with the humidity. Secondly, since all features contribute equally, Knowing only outlook and 
temperature alone can’t give accurate prediction. Even though these assumptions are 


generally not correct in real life situations, the algorithm often works well in practice. 
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The specific algorithm used in this experiment was the Gaussian NB classifier, in which the 
likelihood of the features is assumed to be Gaussian, hence, the conditional probability is 


given by: 


Figure 9 - Gaussian conditional probability [14] 


3. Evaluating machine learning algorithms 
A. Confusion matrix 
A detailed evaluation technique used for ML algorithms is a confusion matrix. It is a table 
which helps get insight into the type of errors the model is making and allows to 


calculate other more specific metrics. 


As seen in Figure 10 below the matrix has two axis “predicted” (horizontal axis) and 
"actual" (vertical axis). O stands for HC and 1 for PD subject. True Negative (TN) holds 
number of correctly predicted negatives. True Positive (TP) holds number correctly 
predicted positives. False Negative (FN) holds incorrectly predicted negatives. And 
False Positive (FP) holds incorrectly predicted positives. Generally, you want to 
minimise both FP and FN. However, in some scenarios minimising one over another is 
more important. For example, a possible metal detector would want to have no False 
Negatives, since not detecting a gun may cost lives of many. On the contrary a spam 
detector would want to decrease False Positives, since it would be very annoying for the 


user to have to search an important email in spam. 


For my scenario it would be best to have a low number of False Negatives, since like 
stated earlier, if the disease is spotted early on, medication can be administered to 


decrease the total damage of the disease. However, having a low number of false 
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positives is also important, since it removes the possible cost of administering 


medication which is not required. 


Predicted 


0 


Predicted 


1 


Figure 10 - Confusion Matrix example 


B. Evaluation metrics 
Firstly, the most basic evaluation metric is the classification accuracy. As the name 
suggests it is just a fraction of right predictions out of total number of predictions. And is 


defined by simple formula below. 


"T correct predictions 
classification accuracy = ——— ———————— 
total predictions 


However, this metric is very basic and doesn't tell us much information about what 


errors the model is making. 


Sensitivity is the probability of testing positive for diseased patients. It will be used to 


determine whether the models are sufficiently sensitive to pick up the disease. 


TP 


Sensitivity = TP + FN 


Specificity refers to probability of testing negative for non-diseased patients i.e., it 


represents the proportion of patients without disease who have negative test result. 
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TN 
FP + TN 


Specificity — 


Finally, the Mathews Correlation Coefficient will be included. Some might argue that the 
F1 score should be included since it is one of the most used metrics used to evaluate 
classification models. However, research shows it is not as accurate as MCC and will 


not be included in this work [15]. 


TP x TN — FP x FN 


J(TP + FP)(TP + FN)(TN + FP)(TN + FN) 


MCC = 


In the MCC formula we can see a balanced consideration of all boxes of the confusion 


matrix, unlike sensitivity or specificity which consider only two boxes. 


C. K-fold cross validation 


Finally, the models will be evaluated on their ability to generalise — ensuring that the 
models perform well with different training data. This will be done by performing k-fold 


cross validation, more specifically 10-fold cross validation which is explained below. 


First the dataset is randomly shuffled to reduce bias, and then is split into 10 folds like 


seen in Figure 11. 


Figure 11 - 10-fold cross-validation 


Initially, 9 folds are used to train the models and 1 to test the models. The predictions 
are obtained from the models produced. Then, the procedure is repeated until all folds 


have been used for testing (Figure 12). 
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Figure 12 - 10-fold cross validation 


4. Hypothesis 


| hypothesise that the k-NN algorithm will perform the best. | base the hypothesis 
primarily due the Naive Bayes' assumption of independence between all the features 
which in this case is not true. The dimensional features of striatum must be closely 
related to each other. For example, as width decreases, thickness and length may also 
decrease, this is because the striatum does not decrease in size one dimensionally but 


instead three dimensionally. 


5. Methodology 


A. Dataset 


The dataset (Figure 13) used in the experiment was obtained with the help of Olivera et 
al. [16]. Who in turn extracted all the features from images obtained from a Parkinson's 
Progression Markers Initiative database [17]. The dataset contains 652 subjects, for the 
groups: control female (73), control male (136), PD female (157) and PD male (286). 
Overall, the healthy control (HC) subjects’ age was 61.8 + 11.3 years old, and the PD 


subjects’ age was 61.7 + 9.7 years old. 
16 


Each row holds the data for a different subject. The Y values are in the first column of 


the figure 13, it stores the real diagnosis of the subject, where 0 is for HC and 1 is for 


PD. The X values in columns 2-4 store the dimensional features of the striatum for each 


subject. They are the Width, Length, and the Thickness of the striatum, same as in 


figure 2. 

1 Diagnosis | Width 

2 0 23.98 
3 0 28.69 
4 0 23.23 
5 0 23.17 
6 0 27.93 
7 0 23.14 
8 0 19.25 
9 0 30.22 
10 0 20.82 
11 0 30.16 
12 0 19.97 
13 0 23.01 
14 0 18.59 
15 0| 25.46 
16 0 20.78 
17 0 22.38 
18 0 27.09 
19 0 25.58 
20 0 23.04 
21 0 26.99 


Figure 13 — Snapshot of Dataset 


B. Experimental Procedure 


1. Use Python to extract the X and Y values from the dataset. 


Length 


39.16 
34.83 
36.40 
35.96 
35.24 
35.61 
31.60 
33.42 
29.63 
38.12 
33.17 
33.29 
26.52 
33.64 
30.88 
33.26 
33.70 
36.93 
33.26 
34.52 


Thickness 


2. Experiment with different values of k to find the one that gives the best accuracy. 


3. Create the k-NN and NB models using the sklearn library. 


4. Perform 10-fold cross-validation on each model and store all the outputs of each 


model in two separate confusion matrices. 


5. Store the metrics of accuracy of each fold in both models in an array. 


6. Find the average value of accuracy, specificity, sensitivity and MCC for each 


model. 


7. Show all the metrics in tables for easier visual comparison. The percentages 


range from 0 to 100%. While MCC ranges from -1, to +1, with extreme values of 


-1 and +1 reached in case of perfect misclassification or perfect classification. 
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6. Results and Analysis 


Confusion Matrix of KNN Confusion Matrix of Naive Bayes 


False Pos 350 
10 
1.5396 


False Pos 
False 15 
300 2.30% 


False 


True label 
N 
e 
o 
True label 


False Neg 
True 4 9 
1.38% 


False Neg True Pos 
True 4 42 401 
6.44% 61.50% 


T 
False True False True 
Predicted label Predicted label 


Figure 14 - k-NN and Naive Bayes confusion matrices 


To begin with, the confusion matrices in figure 14 provide us with the most direct illustration 
of the models' performances by indicating the number of true and false prediction in each 
class. | will be referring to positive as a subject with PD and vice versa. The left confusion 


matrix has outputs from all 10 folds for k-NN, so does the right but for NB. 


Both models have a very high number of True Positives and False Negatives. k-NN has 
66.56% of true positives and 30.52% true negatives, and if summed we get the accuracy of 
97.08%. This is a high score; it shows how most patients were predicted/diagnosed 
correctly. Similarly, the Naive Bayes also has a high number of true positives being 61.5096 
and true negatives being 29.75%, with accuracy of 91.25%. But overall, Naive Bayes 
performed slightly worse, given that its true positives value is less by 5.06% compared to k- 
NN. This is because it classified lots of false negatives (6.4495), and this is bad as the goal 


of testing is to classify the disease and give medication as early as possible to the patients. 


Looking at the accuracy scores for each fold in Table 1 we can see how most folds of k-NN 
were much more accurate than those of NB. In fact, in the first and ninth folds of K- NN were 
able to achieve 100% accuracy. The accuracy of k-NN ranges from 95956-10094 therefore 


demonstrating its excellent generalisation ability. NB on the other hand performed 
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considerably worse in terms of generalisation, even though the highest accuracy was 


96.9% the lowest was 84.6%. This shows how NB can't perform as well on previously 


unseen data as k-NN. 


Fold k-NN accuracy (96) Naive Bayes accuracy (%) 
1 100.0 96.9 
2 95.5 84.8 
3 96.9 90.8 
4 96.9 95.5 
5 96.9 90.8 
6 95.4 89.2 
7 95.4 89.2 
8 96.9 84.6 
9 100.0 95.4 
10 96.9 95.4 
Average accuracy | 97.1 91.3 


Table 1 - Accuracy for each fold and the average 


Table 2 has the summary of main metrics evaluated. Firstly, k-NN has an average accuracy 


of 97%. The average sensitivity value of 98% demonstrates how k-NN is very successful at 


identifying sick patients and misses out a very small number. The average specificity is 


slightly lower being at 95.296 shows how the model is slightly worse at identifying healthy 


patients, which could although not as bad as not spotting sick patients can still be 


problematic. The achieved MCC of k-NN is 0.933. 


Naive Bayes on the other hand had an average accuracy of 91%. The specificity being at 


90.596 is considerably worse than k-NN's. Interestingly, Naive Bayes was more successful 


at identifying healthy patients than sick, with specificity at 92.8%. Finally, NB achieved MCC 


of 0.809. 
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k-NN Naive Bayes 
Average Accuracy (96) 97 91 
Average Sensitivity (%) | 98.0 90.5 
Average Specificity (%) | 95.2 92.8 
Average MCC 0.933 0.809 


Table 2 - Average of metrics 


Overall, it is fair to say that both algorithms achieved relatively high scores in terms of 
predicting PD in patients. However, k-NN was by far the better classifier, outscoring NB in 


all the metrics considered in this experiment. 


. Evaluation of experiment 


This experiment had strong positive aspects of it. Most importantly the data used for 
training the algorithms was properly labelled by experts which enabled the possibility of 
using supervised learning in this experiment. Additionally, the x values used in experiment 
(striatum dimensions), are commonly used by medical staff to give clinical diagnose. As 
such, the data used was already previously highly relevant for the diagnose, and this is 


confirmed by very high scores. 


However, the experiment had limitations. Firstly, there was uneven distribution of male and 
female as well as of PD and HC subjects. As seen in Figure 15, almost three quarters of 


patients were male. 
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Male PD: 286 


Female HC: 73 
Male HC: 136 


Female PD: 157 


Figure 15 - Pie Chart representing Males and Females in the database 


Same can be said for the distribution of healthy controls and sick patients. There are 443 
PD patients and only 209 HC. For possible improvements it would be beneficial to also look 
at how the accuracies differed when taking the dimensional based features individually and 


not together. 


Finally, the accuracies of male and female subjects were not compared separately. In future 
it would be interesting to see whether male and female subjects had any notable 


differences in the classification accuracy. 


As such for improvements a dataset with the same number of PD and HCs should be used, 


and perhaps the male and female subjects should be compared separately. 


. Conclusion 


In conclusion, the combination of supervised machine learning algorithms and striatum 
dimensional features undoubtedly performed positively. Though there are some 
inaccuracies present in the algorithms, overall, the experiment shows how these algorithms 
can be used for assisting the clinical decision of diagnosing Parkinson's disease. In terms 
of comparing Naive Bayes and k-NN, it can be safely said that k-NN is the better algorithm, 


which was confirmed by higher classification accuracy and all the other metrics used. As 
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such, k-NN shows strong potential to be used in a real-life scenario of diagnosing 


Parkinson's Disease. 


. Further research 


Whilst this essay demonstrated that k-NN is the better algorithm for identifying Parkinson's 
disease when using striatum dimensional features as input, it leaves many more possible 
questions to be answered. It would be interesting to see how other supervised machine 
learning algorithms like neural networks or random forest would perform on the same task. 
This would help identify which algorithm out of the supervised learning family has the most 
potential. If possible, it would be interesting to compare how the accuracy changes if 
instead of the dimension values, a real scan image of the striatum is used as input, such as 
in Figure 1. In addition, it would also be interesting using an unsupervised learning 


algorithm, and see whether it can spot patterns in this data that a human might not. 


Parkinson's disease is known to be more present in males than females [18]. It would be 
interesting to see if there are any possible correlations between the gender and the 
degeneration of striatum. Perhaps there could be found a relationship between the 
dimensional features of the striatum and the gender of the patient with Parkinson's disease. 
Whether such relationship exists or not can also be investigated using supervised machine 


learning algorithms. 
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Code Used 
#importing necessary libraries and configurations —— ^ ^ | 


import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 


from sklearn.model selection import train test split 
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier 
Sklearn.naive bayes import GaussianNB 
sklearn.metrics import confusion matrix 
sklearn.metrics import f1 score 
Sklearn.metrics import accuracy score 
sklearn.metrics import ConfusionMatrixDisplay 
Sklearn.model selection import cross val score 
Sklearn.model selection import cross val predict 
import sklearn.metrics 


data - pd.read csv( 
= data.iloc[:, 8:11] 
= data.iloc[:, 0] 
| train, X test, y train, y test - train test split(X, y, random state-0, test size-0.2) 
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classifier KNN (n neighbors-11, p=2, metric-'euclidean') 
classifier NB - () 

classifier NB.fit(X train, y train) 

classifier KNN.fit(X train, y train) 


scores = cross val score(classifier ENN, X, y, cv-10) 
print(scores) 
accuracy of KNN with a standard deviation of " % (scores.mean(), 


cross val score(classifier NB, X, y, cv-10) 
print(scores1) 
print(" accuracy of NB with a standard deviation of " % (scoresi.mean(), 
scores1.std())) 
print() 


y pred KNN = cross val predict(classifier KNN, X, y, cv-10) 
cm KNN CV = confusion matrix(y, y pred KNN) 
cm display KNN CV - (confusion matrix - cm KNN CV, display labels - 
[ > ]) 
cm display KNN CV.plot(cmap - .cm.Blues) 
.title("Confusion Matrix of KNN") 


. show( ) 


y pred NB - cross val predict(classifier NB, X, y, cv-10) 
cm NB CV - confusion matrix(y, y pred NB) 
cm display NB CV = (confusion matrix - cm NB CV, display labels - 
[ , ]) 
cm display NB CV.plot(cmap = .cm.Blues) 
.title("Confusion Matrix of Naive Bayes") 
.show() 


Dataset Used 


nosilWidth Length — Thicknes| Diagnosil width 


gth — Thickness [Diagnos[width Length __Thicknes| Di: 


Diagnosifwidth Length _ Thicknes| Diagnosil Width 


Thicknes|Diagnosi|Width — Lenj 


Dia 
3916 DE of 2956 3931 3003 of 2611 om 3180 of oe 3003 2385 of 1458 3041 1943| 2477 3738 — 2915| i am sat 70 1925 2010  2385| 
3483 28.27 2163 3364 2915| 1759 2611 — 2297| 1702 2138 1787 2790 3853 2738 2238 4188  30.03| 2483 2546 1678 1307 1179 1943| 


2082 3486  2650| 1611 1662 


3640 2475 2630 3605 24.73] 2163 3552 2915| 2808 3997 3092 1922 2966 2385| 000 000 000 

oj 23.17 35.96 2915| 1451 2564 1943| 2621 3853 2738| 2712) 4480 27.38] 2546 3358 — 2738| 2784 3840 2915| 1693 1583 000 000 00) 
H 27.93 35.24 2827| 2793 3696 27.38 2401 3135  2827| 2157, 33.29  2738| 2317 3834 2738| 2552 3210  2827| 1934 1674 2229 2765 28.27 
oj 23.14 3561 25.62] 2317 3398 am 1690 3198 2032 2078 3599 om 2323 3599 om 2091 4072  2915| 1144 1107 000 000 000 
oj 19.25 3160 2208| 2944 — 4235 35.34) 2235 3555  2650| 2869 4267 2915| 1994 3129 2297| 1994 3238 2582 1216 — 1354 984 991 1502 
oj 30.22 33.42 29.15| 3032 Auer 3622 2637 3317 ois 2869 4282 2827 2160 3517  2650| 1997 3163  2385| 2166 — 1840 Dm 000 000 
H 20.82 29.63 23.85] 2872 3853 27.38 1843 2533 2120| 16.11. 3430 2208] 2709 3884 om 1922 3126 2120 000 000 2082 2445  2473| 
H 30.16 38.12 28.27| 1997 3010  2385| 1931 2969  2650| 2555 38.56 24.73) 2386 — 3834 2738| 2069 3044 2297| 1260 1310 981 i154 — 1237| 
H 1997 3317 2208 3035 4486 om 1922 3088  2473| 2709 3956 2915| 2386 — 3715 3003| 2379 4113 3180 1928 1549 2169 1912  1855| 
2301 3329 2473 2872 3956 om 2621 3756  2385| 2395 3671 22.08] 2872 4207 2827 2790 3950 2582 000 — 000 0.00 000 000 

1859 2652 2385| 3035 43.58 — 3445| 2163 3276 2385| 2640 35.58 27.38 2323 3684 2738| 1533 2173 1943| 000 000 1815 1107  1678| 


2709 4154  2650| 


0 H 

oj oj oj oj 0 H H 

oj d oj oj 0 al 1j 

of oj oj oj 0 1 1j 

oj oj d oj 0 D 1j 

oj oj oj oj 0 D 1j 

oj oj oj oj 0 al 1j 

ol oj oj oj 0 al 1j 

ol oj oj oj 0 D 1j 

oj oj oj oj 0 1j 1j 

oj oj oj oj H D 1j 

oj oj oj oj H 1 H 

ol oj oj oj H al H 
o 25.46 33.64 BE d 2787 3608 2738 d 1928 2771  1943| oj d 2298 3474 d 2564 4015  2473| i| wm i147 1j 668 — 1031 883 
H 2078 30.88 2208| d oe 3160 1767 oj 2310 3486 "sei d 2621 2988 2562 oj 2210 2267 d 2163 3320  2208| i| 1853 — 1599 i| 1928 1709 sei 
oj 22.38 33.26 23.85| d 2082 3439 2297 oj 2633 3611 2297 d 1862 2765 2562 d 2245 am of 2329 3521  2473| i| om 000 i ma om 2120 
E 27.09 33.70 25.82] d 2474 3837 2650 d 2147 3126  2208| d 2555 3144 3003 d 2082 3047 d am 3997 om al 000 — 000 i| 1144 i226 — 1148| 
oj 25.58 36.93 22.08) d 1915 3169 1678 d om 3477  2473| oj 1909 3285 2208 d 2238 2928 d 1746 3195 Gm i| 1458 11.44 i| 1768 1818 — 2297 
oj 23.04 33.26 26.50] d 1934 3282 2473 d 2157 3367  2385| d 2310 am 2827 d 2006 3477 d 2401 3756 om i| 1931 2158 1j 000 000 000 
H 26.99 3452 25.63| d 2712 3521 2738 oj om 4201 "sei d 276 276 255 oj 1922 3006 i| 3035 2574  2915| i| 2332 2144 1j 978 1075 1502 
H 2872 3922 2650| d 1683 2727 2208 oj 2561 3640 2562 of 2317 3517 2385 d om 3474 i 1301 1502 — 1943| i| 2091 1919 H 000 000 000 
H 2235 28.50 2032] d 1687 2847 2120 oj 2082 3238 2032 d 2160 3126 2297 d 1834 3047 i| 1455 1154 Ga i| 2088 19.47 i| 2091 1784 2032 
1683 2104 1678 d 2398 am 2385 d 1859 2853 1943| d 2944 2320 3445 d on oe i| oo om o0) H 671 8.28 i| 2489 2022 25.62 
oj 27.09 33.67 24.73| oj 2950 3646 3003 d 1693 2489 2032 d 2154 3715 2915 d 2467 3994 i 1768 2100 2257 i| 31056 1273 i| 1699 1894 2229 
oj 19.22 26.11 2120] d 2392 3555 2738 oj am am  1128| d oe 3680 3245 d 1925 3166 H 235 241 265 i| 1843 — 1677 i| 2323 1900 2827 
E 26.40 3674 31.80] d 2705 3517 3180 d 1840 2414 —2208| oj 2608 3903 3180 d 2072 am i| 1693 1593 —2473| i Dn o i825 i| 1696 1267 2032 
oj 19.81 31.26 26.50] d 2869 4079 3003 d 2082 3329  2208| d 2251 3242 3268 d oe 2884 i om 000 ` oo i| 2458 2671 i| 2392 3633  2738| 
E 2872 38.00 2738| d 1919 2339 2297 d 2232 3160 "sei d 2546 3596 3268 oj 1931 2536 i ua 1154 972| al 000 000 i| 2085 1837 2297 
H 26.40 3881 29.15] d 2317 3790 2738 d 2006 2931  2915| d 2072 3238 2650 d 1994 om i| 1677 1549 — 2208| i| 2401 20538 i| 1376 1310 — :237| 
of 2921 4163 2650| d 3032 4539 2738 d 1771 2840  1923| d 2235 3637 3710 d 2480 3677 i| 984 1430  1855| i| 16.27 3392 i| 2006 1621  2208| 
H 2235 3405 2562 d 2477 oam 2562 d 2314 3483 og d i749 2731 — 3092 d 2872 3840 i| 1618 1580  1943| i 1759 r2 i on 2332 — 2650| 
H 2953 4126 30.03} d 2395 3837 3003 d 2395 om  2473| d 2800 3555 3710 d 2477 3405 i 2075 2100 2257 i naro i307 i oe 2135, 2120 
oj 2251 3289 26.50 d 2075 3160 2473 d 3100 24075 3180 d 2163 am 2582 oj 2082 3326 i| 2063 1806 2582 i| 1458 — 1897 i| 1695 1668 2208| 
H 2398 3445 2738] d 2872 3693 2915 d 2552 2615  2208| d 1774 3006 2120 oj 2248 30.88 i| 1383 1226 15390] i| om 0.00 i 1778 2336 — 2582 
H 3110 4521 30.92| d 2470 3988 3092 oj 2144 347 2650 d 2160 3561 2032 d am 43.11 i| 2326 2028  2827| i| 31060 ug 1j 000 000 000 
oj 29.47 4242 3268 of 2768 4119 3180 d 2314 3358  2915| d 1837 2301 1678 d 2235 am 1j 354 398 — 442 i| 2310. 25:11 i| 2568 1981  2473| 
E | 3113 40.10 2827| d 3195 3875 3357 d 1847 3122  3003| d 1605 2693 2032 oj 1451 3088 H 984 1069 972 i| 2176 2298 i 1759 om — 2473| 
Lv. o 3116 4151 3268 d 2803 3840 3092 oj 2082 3285 2297 d 1539 2724 2208 d 2373 3875 ona 915 — 972| i| 1220 14.64 i| 1386 1069 1590 


hos[width Length — Thickness [Diagnosi|Width Length _Thicknes|Diagnosi]Width Length Thicknes[Disgnosi[wiatn Length — Thicknes[Disgnosi[widtn Length — Thicknes[Disgnosi[widtn Length _Thicknes|Diagnosi] Width Length _Thicknes|Diagnosi] Width Length Thickness 
000 am al 


H 1373 10.00 13.25] af 2245 1868 — 2473| i 665 742 442 i| 1608 1947  1413| i| 2169 2100  2473| H T Y T af 2245 2292 om H 000 000 00] 
1j 1223 12.26 1678] i| 1455 1113, 1080] i an 1069  1413| al 828 555 7.07] H 5.11 398 530 i| 1056 759 10.60 i| 1455 1743, 2120 1j 000 000 000 
al 0.00 0.00 0.00} i| 1847 1865. 2208| i| 1461 i850 "sei 1j 354 276 — 442 i| 2010 2179 2473| al 276 238 177 i| 2803 "em 3003| i| 1699 1897  1923| 
al 14.51 16.15 19.43| i| 2163 1865 — 2208| i| om ooo ooo] H 000 000 000 i| op 078 oss i| 1683 1433 Gei i 593 752 618 i 1809 1511 1678| 
al 0.00 0.00 0.00} al 000 0.00 088] i| 2282 1950 2297| 1j 984 — 756 Se i| 2405 2452  2473| i 1179 1351 1678| i| 31611 1401 — 1502 i| 1220 1426 — 1502 
1j 1138 1226 1855| i| 2094 2010 2032 i| am 909 1148 i| 1060 — 956 795| i| 2329 2499 — 2738 al 7.46 872 735 i| 828 1025 8383 H 000 000 00) 
1j 0.00 0.00 0.00} i| 2718 — 2430 2297 i| 1376 1263 — 1678 1j 000 000 088] i| om ooo 000 i| 2235 2342  2650| i| 1693 1351 i| 2003 2097 2297 
1j 5.08 558 DS i| 1455 1157) — 1502 i| om 1038 — 1325 1j 508 — 558 5.30] i| 1455 1583 — 2208 al 9.88 630 ag i| 1768 — 1696 1j 988 1069  1148| 
1 1220 1273 19.43] 1j 000 000 000 H 589 357 ` Se i| 2558 2552 2562 i| 2564 2593 — 194 i| 1213 1470 — 1678| i 630 674 i| oe 1746 — 1855| 
1j 1665 16.99 2297| i| 1455 14.30 1943 i| ooo oo ` oo 1j 198 157 0.88] i| 1602 1878 — 1415 i| 1536 jaa — 1237| i| 2078 1909 i 1455 1270 — 1502| 
al 1144 1031 1237] i| 1301 1147. 2032] i 1611 1784 — 1855| i| 1608 1818 — 1923| i| om 000 00 i| 1586 2122 se H 000 — 000 i| 1135 16.65, 1413| 
1j 1533 15.05 1148] i| 1683 2094 sei H 705 945 797 al 984 984 1502 i| 1847 — 1746 16.78 i 1260 1505 — 2120] i| mis 2198 i| 2006 2100  1855| 
al 12.26 1194 10.60] i mm 1430 i590 i| 1223 1467 —1678| i| 1815. 2179 — 2120 i| 1455 1395 1329 i| 1448 — 1552 — i237| i| jem 1897 1j 000 000 00) 
al 16.11 14.61 15.90| i| 1618 16.21 1943 i 1301 1345 — 1678| i| 2000 17.62  1325| i| 2169 2069 1943| i| 1533 1430 14.13} i 16.27 1577 i 1461 1351 Ga 
1j 0.00 0.00 0.00} i| 17.68 2019 — 2385| i| 1066 1066 — 972 i| 1991 2094 — 2473| i| 1301 1345 Gei i| 1693 1984 2297 i 595 Sai i| 1295 1464 1148] 
al 17.74 15.86 1678] i| 1097 1238 883| i| 1693 1674 — 1590 i| 1448 — 1549 16.78] i| 2160 2135 2738 i| 2543 2916  2827| al 078 119 i| 1063 1031 1237 
al 12.20 1194 14.13] i| 2166 1994  1678| i| om 000 00 1j 000 000 000 i| 2006 1900 2257 i| 1567 1630  1678| i| 1216 — 1386 H 824 793 — 797 
1j 16.08 17.46 1502 1j 000 000 ` om i| 1608 1508 — 1550 i| 15.30 1470 — 1678| i| 831 749 1080 i| 1690 1856  2385| i 8928 712 i 2072 2734  2650| 
1j 828 872 1237] i| 1687 i790  2473| i| 988 1226 1237 1j 51 674 8.83] i| 1260 sai 15.02 i 1611 18.28 — 1855| i| om oo0 i ooo om ` oo 
al 1141 1031 10.60] i| 2082 2094 2208] i| 1301 956 — 1060 i ui 1:273 — 1502 i| 1542 1426 Gei i| 2144 2558 —2738| i| om oo i| 1533 jee — 1855| 
1j 1220 1072 ac i| 31539 — 1821 2329 i| 1066 953  1148| i| 2000 3596 3092 i| 1687 1351 1502 i 1771 3699 sei i 1922 13.92 i| 1919 2339 2297 
1j 2169 1981 2297| i| 1379 1461 1413 i| 1379 1345 — 1590 i| 20053 2182 2208] i| 2006 2492 25.63 i| 2091 1903 2208] i| 2135 — 2530 i| 1069 1110 oe 
al 14.61 14.70 20.32ą i| om 000 ` om i| 1843 1502 — 1550 i| 1536/1273 — 2032| i| 1765 1426 329 i| 2392 3292  2915| i 98 1426 i mon om 19.43} 
al 2169 37.12 33.57| i| 1687 1937 Gei i| ap 953 883| i| 10.63 1188 1148] i| 1615 1511 2208] i| 1862 1658 18.55} i| 2088 2535 i 1615 1740 — 2208| 
1j 1458 16.99 2120| i| 31376 1392. 1590 i| am 988 1590 i| 1533 jee — 1678| i| 1768 1859 Ge i Im 31552 — 13.25} i 978 1185 H 000 000 00) 
1j 15.36 15.86 13.25] i| 2008 1868 —2738| i| an 912 1080) i| 2160 2179 2032 i| 1536 — 1304 — 1855| i| 1994 2887  2208| i om 000 H 824 759 ` Se 
1j 2483 2072 29.15] i| 1301 1423, 1855| H 000 0.00 000 al 000 0.00 0.00] i| um 1066 — 1590 H 7.46 756 735 i| 1774 — 1906 i| 1690 1743 — 2385| 
1j 1220 1270 1767 i| 1539 1815, 2032] i| 1771 1668 1767| 1j 000 000 000 i| 2091 2104  2385| i oam 000 ` oo i| 1693 1756 i| 1746 2019 2297 
1j 0.00 0.00 0.00} i| 2248 1796 1943| i| 2003 2295 2473 i| 16131 1781 1855| i| 1696 1586 2032 i| 1060 1270 883| i| am — 790 i| 1539 om — 2032 
al 2242 24.48 28.27 i| un aan 12.37] i| om ooo ooo i om oo ` oo i| 2160 2533 2650 i| 1451 1467 — 1678| i ua 17.78 i| 1843 2179 2297 
1j 1850 18.65 1590 i| 2323 1953 1502 i| om 00 oo 1j 000 000 000 i| 1690 1821  2208| 1o 1144 991 1237 i| 31611 1665 H 514 514 7907 
H 2480 2693 30.03] i| 1919 2922 24.73} i| om 1348 — 1502 1j 000 000 000 i| 1298 1624 185: al 828 875 1080 al 984 — 748 i| 2956 2768 asi 
al 18.53 18.56 19.43| H 944 — 9538 8.83] i| om 15.77 1767| al Am 439 5.30 i| 1533 1583 — 1945 i| 1579 1188 — 1943| i| 1504 — 1586 i 1774 1972 2582 
H 1931 19.00 2120| i| 1376 1781 — i590 i| 1699 1348 1767 i wa 1470 2032] i| ap 1345 e 1j 746 — 828  1325| i| 1383 — 1586 i 2091 1784 2582 
al 1784 1941 2738| i| 1687 1937 1590 i| 1925 2414 — 2473 i| 15.33) i746 — 1855| i| 2091 1953 — 2650 i om oo ` oo i 984 1035 1| i138 — 1542 — 1325| 
al 1778 1740 25.82] H 828 — 872  1148| i no 834 797 i| 1699 1862 — 2032 i| 1850 1868 — 2032 i 1611 i389 — 1590 i| om 000 i 1618 1583 — 1413| 
al 17.62 1715 2297] H 000 000 000 i| 1635 — 1937 — 1855| i 13.67, 1547 8.83] 1 7.09 517 10.60 1j 5.96 5.93 735 i| 1379 — 1624 i| 2082 1636 1943| 
A 2173 1699 8 1856 — 1853 — 194: 12.16 115 4g 2238. 2110 8 000 — 000 1530 — 1784 24 1847 1859 20 1765 . 1715 0 
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